Semi-Automatic Acquisition of Domain-Specific Translation Lexicons
نویسندگان
چکیده
We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons. 1 I n t r o d u c t i o n Reliable translation lexicons are useful in many applications, such as cross-language text retrieval. Although general purpose machine readable bilingual dictionaries are sometimes available, and although some methods for acquiring translation lexicons automatically from large corpora have been proposed, less attention has been paid to the problem of acquiring bilingual terminology specific to a domain, especially given domain-specific parallel corpora of only limited size. In this paper, we investigate the utility of an algorithm for translation lexicon acquisition (Melamed, 1997), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons. The goal is to produce material suitable for postprocessing in a lexicon acquisition process like the following: 1. Run the automatic lexicon acquisition algorithm on a domain-specific parallel corpus. 2. Automatically filter out "general usage" entries that already appear in a machine readable dictionary (MRD) or other general usage lexical resources. 3. Manually filter out incorrect or irrelevant entries from the remaining list. Our aim, therefore, is to achieve sufficient recall and precision to make this process in particular the time and manual effort required in Step 3 a viable alternative to manual creation of translation lexicons without automated assistance. The literature on cross-lingual text retrieval (CLTR) includes work that is closely related to this research, in that recent approaches emphasize the use of dictionaryand corpus-based techniques for translating queries from a source language into the language of the document collection (Oard, 1997). Davis and Dunning (1995), for example, generate target-language queries using a corpus-based technique that is similar in several respects to the work described here. However, the approach does not attempt to distinguish domain-specific from general usage term pairs, and it involves no manual intervention. The work reported here, focusing on semiautomating the process of acquiring translation lexicons specific to a domain, can be viewed as providing bilingual dictionary entries for CLTR methods like that used by Davis in later work (Davis, 1996), in which dictionary-based generation of an ambiguous target language query is followed by corpus-based disambiguation of that query. Turning to the literature on bilingual terminology identification per se, although monolingual terminology extraction is a problem that has been previously explored, often with respect to identifying relevant multi-word terms (e.g. (Daille, 1996; Smadja, 1993)), less prior work exists for bilingual acquisition of domain-specific translations. Termight (Dagun and Church, 1994) is one method for analyzing parallel corpora to discover translations in technical terminology; Dagan and Church report accuracy of 40% given an English/German technical manual, and observe that even this relatively low accuracy permits the successful application of the system in a translation bureau, when used in conjunction with an appropriate user interface. The Champollion system (Smadja, McKeown, and Hatzivassiloglou, 1996) moves toward higher accuracy (around 73%) and considerably greater flexibility in the handling of multi-word translations, though the algorithm has been applied primarily to very large corpora such as the Hansards (3-9 million words; Smadja et al. observe that the method has difficulty handling low-frequency cases), and no
منابع مشابه
Semi-automatic Acquisition of Domain-speciic Translation Lexicons
We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons , when that algorithm is applied to a much smaller corpus to produce candidates for domain-speciic translation lexicons.
متن کاملSemi-Automatic Acquisition of Domain-Specific Translation Lexicons
We investigate the utility of an algorithm for translation lexicon acquisition (SABLE), used previously on a very large corpus to acquire general translation lexicons, when that algorithm is applied to a much smaller corpus to produce candidates for domain-specific translation lexicons. 1 I n t r o d u c t i o n Reliable translation lexicons are useful in many applications, such as cross-langua...
متن کاملTowards Semi Automatic Construction of a Lexical Ontology for Persian
Lexical ontologies and semantic lexicons are important resources in natural language processing. They are used in various tasks and applications, especially where semantic processing is evolved such as question answering, machine translation, text understanding, information retrieval and extraction, content management, text summarization, knowledge acquisition and semantic search engines. Altho...
متن کاملTwo Principles and Six Techniques for Rapid Mt Development
In this paper we describe a range of techniques used at NMSU CRL for accelerating the development of MT systems. These techniques enable semi-automatic development of a number of components of a multilingual MT system, thereby enabling rapid deployment of MT capabilities in a new language. First, we describe the core multi-engine, multilingual architecture that enables the different techniques ...
متن کاملStochastic Language Models for Automatic Acquisition of Lexicons from Printed Bilingual Dictionaries
Electronic bilingual lexicons are crucial for machine translation, cross-lingual information retrieval and speech recognition. For low-density languages, however, the availability of electronic bilingual lexicons is questionable. One solution is to acquire electronic lexicons from printed bilingual dictionaries. While manual data entry is a possibility, automatic acquisition of lexicons from sc...
متن کامل